# CS61c Fall 2021 Lecture 14 Pipelining RISC-V



### Review

Computer Science 61C Fall 2021

Wawrzynek and Weave

#### Controller

Tells universal datapath how to execute each instruction

#### Instruction timing

- Set by instruction complexity, architecture, technology
- Pipelining increases clock frequency, "instructions per second"
  - But does not reduce time to complete instruction

#### Performance measures

- Different measures depending on objective
  - Response time
  - Jobs / second
  - Energy Efficiency (joules/operation, joules/instruction)



### Processor

Computer Science 61C Fall 2021 Wawrzynek and Weaver





### Pipelining Overview (Review)



- Pipelining doesn't help *latency* of single task, it helps *throughput* of entire workload
- Multiple tasks operating simultaneously using different resources
- Potential speedup = Number pipe stages
- Time to "fill" pipeline and time to "drain" it reduces speedup:
  - 2.3X v. 4X in this example
  - With lots of laundry, approaches 4X



#### Agenda

Computer Science 61C Fall 2021 Wawrzynek and Weaver

- RISC-V Pipeline
- Pipeline Control
- Hazards
  - Structural
  - Data
  - R-type instructions
  - Load
  - Control
- Superscalar processors



5

### Recap: Pipelining with RISC-V

Computer Science 61C Fall 2021

Wawrzynek and Weaver

instruction sequence

add t0, t1, t2

or t3, t4, t5

sll t6, t0, t3



|                                            | Single Cycle                          | Pipelining             |
|--------------------------------------------|---------------------------------------|------------------------|
| Timing                                     | $t_{step} = 100 \dots 200 \text{ ps}$ | $t_{cycle}$ = 200 ps   |
|                                            | Register access only 100 ps           | All cycles same length |
| Instruction time, t <sub>instruction</sub> | $= t_{cycle} = 800 \text{ ps}$        | 1000 ps                |
| Clock rate, $f_s$                          | 1/800 ps = 1.25 GHz                   | 1/200 ps = 5 GHz       |
| Relative speed                             | 1 x                                   | 4 x                    |



sw t0, 4(t3)

lw t0, 8(t3)

addi t2, t2, 1



sequence

### Single-Cycle RISC-V RV32I Datapath

Computer Science 61C Fall 2021 Wawrzynek and Weaver





#### Pipelining RISC-V RV32I Datapath





#### Pipelined RISC-V RV32I Datapath

Computer Science 61C Fall 2021 Wawrzynek and Weaver





Must pipeline instruction along with data, so control operates correctly in each stage

## Each stage operates on different instruction





Pipeline registers separate stages, hold data for each instruction in flight

#### Agenda

Computer Science 61C Fall 2021 Wawrzynek and Weaver

- RISC-V Pipeline
- Pipeline Control
- Hazards
  - Structural
  - Data
  - R-type instructions
  - Load
  - Control
- Superscalar processors



#### Pipelined Control

Computer Science 61C Fall 2021

Wawrzynek and Weaven

- Control signals derived from instruction
  - As in single-cycle implementation
  - Information is stored in pipeline registers for use by later stages





1 / throughput

Wawrzynek and Weaver

Computer Science 61C Fall 2021

#### Pipelining the single-cycle processor can increase processor performance by:

|   | Instructions<br>/program | Cycles/<br>instruction | Time/cycle |
|---|--------------------------|------------------------|------------|
| A | decrease                 | decrease               | same       |
| B | same                     | increase               | decrease   |
| C | same                     | same                   | decrease   |
| D | increase                 | decrease               | increase   |



#### Hazards Ahead

Computer Science 61C Fall 2021 Wawrzynek and Weaver





#### Agenda

Computer Science 61C Fall 2021

Wawrzynek and Weaver

- RISC-V Pipeline
- Pipeline Control
- Hazards
  - Structural ⇒ Two instructions compete for hard ware
  - Data
  - R-type instructions
  - Load
  - Control
- Superscalar processors



#### Structural Hazard

Computer Science 61C Fall 2021

Wawrzynek and We

- **Problem:** Two or more instructions in the pipeline compete for access to a single physical resource
- **Solution 1:** Instructions take turns to use resource, some instructions have to stall
- Solution 2: Add more hardware to machine
  - Can always solve a structural hazard by adding more hardware



#### Regfile Structural Hazards

Computer Science 61C Fall 2021 Wawrzynek and Weav

- Each instruction:
  - can read up to two operands in decode stage
  - can write one value in writeback stage
  - therefore two different instructions might be accessing the register file on the same cycle!
- Avoid structural hazard by having separate "ports"
  - two independent read ports and one independent write port
- Reads from one instruction and writes from another can happen simultaneously



### Structural Hazard: Memory Access

**Computer Science 61C Fall 2021** 

Wawrzynek and Weaver

add t0, t1, t2

sw t0, 4(t5)

slt t6, t0, t3

or t3, t4, t5

lw t0, 8(t3)



instruction

sequence

#### Instruction and Data Caches

Computer Science 61C Fall 2021 Wawrzynek and Weaver



Caches: relatively small and fast "buffer" memories



#### Structural Hazards – Summary

Computer Science 61C Fall 2021 Wawrzynek and Weave

- Conflict for use of a resource
- In RISC-V pipeline with a single memory
  - Load/store requires data access
  - Without separate memories, instruction fetch would have to stall for that cycle
    - All other operations in pipeline would have to wait
- Pipelined datapaths require separate instruction/data memories
  - Or at least separate instruction/data caches
- Multi-ported register file
- RISC ISAs (including RISC-V) designed to avoid structural hazards
  - e.g. at most one memory access/instruction
  - limited operands per instruction



#### Agenda

Computer Science 61C Fall 2021 Wawrzynek and Weaver

- RISC-V Pipeline
- Pipeline Control
- Hazards
  - Structural
  - Data
  - R-type instructions
  - Load
  - Control
- Superscalar processors



### Data Hazard: Register Access

Computer Science 61C Fall 2021

Wawrzynek and Weaver

- Separate ports, but what if write to same value as read?
- Does sw in the example fetch the old or new value?

add t0, t1, t2

or t3, t4, t5

slt t6, t4, t3

sw t0, 4(t3)

lw t0, 8(t3)





instruction

sequen

### Register Access Policy

Computer Science 61C Fall 2021

Wawrzynek and Weaver

add t0, t1, t2

or t3, t4, t5

slt t6, t4, t3

sw t0, 4(t3)

lw t0, 8(t3)



- Exploit high speed of register file (100 ps)
  - 1) WB updates value
  - 2) ID reads new value
  - Indicated in diagram by shading



Might not always be possible to write then read in same cycle, especially in high-frequency designs.



instruction

sequence

#### Data Hazard: ALU Result

Computer Science 61C Fall 2021

Wawrzynek and Weaver





#### Solution 1: Stalling

Computer Science 61C Fall 2021

Wawrzynek and Weaver

Problem: Instruction depends on result from previous instruction

add s0, t0, t1
 sub t2, s0, t3



- Bubble:
  - stall the dependent instruction
  - effectively NOP: affected pipeline stages do "nothing"



#### Stalls and Performance

Computer Science 61C Fall 2021 Wawrzynek and Weave

- Stalls reduce performance
  - But stalls might be required to get correct results
- Compiler could try to arrange code to avoid hazards and stalls
  - Requires knowledge of the pipeline structure



#### Solution 2: Forwarding

**Computer Science 61C Fall 2021 Wawrzynek and Weaver** 

Value of t0 add t0, t1, t2 or t3, t0, t5 instruction sequenc sub t6, t0, t3 xor t5, t1, t0

sw t0, 8(t3)



Forwarding: grab operand from pipeline stage, rather than register file



### Forwarding (aka Bypassing)

Computer Science 61C Fall 2021

Wawrzynek and Weaver

- Access result before it is stored in a register
- Requires extra connections in the datapath





# 1) Detect Need for Forwarding (example)

Computer Science 61C Fall 2021

Wawrzynek and Weave

add t0, t1, t2

or t3, t0, t5

sub t6, t0, t3



### Example Forwarding Path

Computer Science 61C Fall 2021

Wawrzynek and Weaver





#### Agenda

Computer Science 61C Fall 2021 Wawrzynek and Weaver

- RISC-V Pipeline
- Pipeline Control
- Hazards
  - Structural
  - Data
  - R-type instructions
  - Load
  - Control
- Superscalar processors



#### Load Data Hazard

Computer Science 61C Fall 2021 Wawrzynek and Weaver





#### Stall Pipeline

Computer Science 61C Fall 2021

Wawrzynek and Weaver





#### 1w Data Hazard

Computer Science 61C Fall 2021

Wawrzynek and Wea

- Slot after a load is called a load delay slot
  - If that instruction uses the result of the load, then the hardware will stall for one cycle
  - Equivalent to inserting an explicit nop in the slot
    - except the latter uses more code space
  - Performance loss!
- Idea:
  - Put unrelated instruction into load delay slot
  - No performance loss!



#### Code Scheduling to Avoid Stalls

Computer Science 61C Fall 2021 Wawrzynek and Weave

- Reorder code to avoid use of load result in the next instruction!
- RISC-V code for D=A+B; E=A+C;



# Agenda

Computer Science 61C Fall 2021 Wawrzynek and Weaver

- RISC-V Pipeline
- Pipeline Control
- Hazards
  - Structural
  - Data
  - R-type instructions
  - Load
  - Control
- Superscalar processors



#### Control Hazards

**Computer Science 61C Fall 2021** 

Wawrzynek and Weaver

beq t0, t1, label

sub t2, s0, t5

or t6, s0, t3

xor t5, t1, s0

sw s0, 8(t3)





#### Observation

Computer Science 61C Fall 2021

 If branch not taken, then instructions fetched sequentially after branch are correct

 If branch or jump taken, then need to flush incorrect instructions from pipeline by converting to NOPs



Wawrzynek and Weaver

### Kill Instructions after Branch if Taken

Computer Science 61C Fall 2021 Wawrzynek and Weaver

beq t0, t1, label

sub t2, s0, t5

or t6, s0, t3

label: xxxxxx





# Reducing Branch Penalties

Computer Science 61C Fall 2021

Wawrzynek and Weave

- Every taken branch in simple pipeline costs 2 dead cycles
   To improve performance, use "branch prediction" to guess
- To improve performance, use "branch prediction" to guess which way branch will go earlier in pipeline
- Only flush pipeline if branch prediction was incorrect



### Branch Prediction

Computer Science 61C Fall 2021 Wawrzynek and Weaver

Taken branch beq t0, t1, label Guess next PC! label: ..... Check guess correct



## Implementing Branch Prediction...

Computer Science 61C Fall 202

Wawrzynek and Weave

- This is a CS152 topic, but some ideas:
  - Branch prediction is critical for performance on deeper pipelines/superscalar as the "Misprediction penalty" is vastly higher
- Keep a branch prediction buffer/cache: Small memory addressed by the lowest bits of PC
  - During instruction decode, if branch: Look up whether branch was taken last time?
    - If yes, compute PC + offset and fetch that (or store actual branch target address from last time)
    - If no, stick with PC + 4
  - If branch hasn't been seen before
    - Assume forward branches are not taken, backward branches are taken
  - Update state on predictor with results of branch when it is finally calculated



# Agenda

Computer Science 61C Fall 2021 Wawrzynek and Weaver

- RISC-V Pipeline
- Pipeline Control
- Hazards
  - Structural
  - Data
  - R-type instructions
  - Load
  - Control
- Superscalar processors 超标量处理器.



## Increasing Single Processor Core Performance

Computer Science 61C Fall 2021

Wawrzynek and Weave

#### 1. Clock rate

- Limited by technology and power dissipation
- 2. Pipelining
  - "Overlap" instruction execution
  - Deeper pipeline: 5 => 10 => 15 stages

  - But more potential for hazards (CPI > 1)
  - Depth limited by max clock rate and power dissipation
- 3. Multi-issue "superscalar" processor



## Superscalar Processor

Computer Science 61C Fall 2021

Wawrzynek and Weave

- Multiple issue "superscalar"
  - Replicate pipeline stages ⇒ multiple pipelines
  - Start multiple instructions per clock cycle
  - CPI < 1, so use Instructions Per Cycle (IPC)</li>
  - E.g., 4GHz 4-way multiple-issue
    - 16 BIPS, peak CPI = 0.25, peak IPC = 4
  - Dependencies reduce this in practice
- "Out-of-Order" execution
  - Reorder instructions dynamically in hardware to reduce impact of hazards:
     EG, memory/cache misses
- CS152 discusses these techniques!



# Out Of Order Superscalar Processor

Computer Science 61C Fall 2021

Wawrzynek and Weaver



P&H p. 340



### Benchmark: CPI of Intel Core i7

Computer Science 61C Fall 2021 Wawrzynek and Weaver



CPI of Intel Core i7 920 running SPEC2006 integer benchmarks.



### And That Is A Beast...

Computer Science 61C Fall 2021 Wawrzynek and Weaven

- 6 separate functional units
  - 3x ALU
  - 3 for memory operations
- 20-24 stage pipeline
- Aggressive branch prediction and other optimizations
  - Massive out-of-order capability:
     Can reorder up to 128 micro-operation instructions!
- And yet it still barely averages a 1 on CPI!





## Pipelining and ISA Design

Computer Science 61C Fall 2021

Wawrzynek and Weaver

### RISC-V ISA designed for pipelining

- All instructions are 32-bits in the RV-32 ISA
  - Easy to fetch and decode in one cycle
    - Variant additions add 16b and 64b instructions,
       but can tell by looking at just the first bytes what type it is
  - Versus x86: 1- to 15-byte instructions
    - Requires additional pipeline stages for decoding instructions
- Few and regular instruction formats
  - Decode and read registers in one step
- Load/store addressing
  - Calculate address in 3rd stage, access memory in 4th stage
- Alignment of memory operands
  - Memory access takes only one cycle



#### In Conclusion

Computer Science 61C Fall 2021

Wawrzynek and Wea

- Pipelining increases throughput by overlapping execution of multiple instructions
- All pipeline stages have same duration
  - Choose partition that accommodates this constraint
- Hazards potentially limit performance
  - Maximizing performance requires programmer/compiler assistance
- Superscalar processors use multiple execution units for additional instruction level parallelism
  - Performance benefit highly code dependent

